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Abstract 

Background: Neglected tropical diseases, especially those caused by helminths, constitute some of the most 
common infections of the world's poorest people. Development of techniques for automated, high-throughput 
drug screening against these diseases, especially in whole-organism settings, constitutes one of the great 
challenges of modern drug discovery. 

Method: We present a method for enabling high-throughput phenotypic drug screening against diseases caused 
by helminths with a focus on schistosomiasis. The proposed method allows for a quantitative analysis of the 
systemic impact of a drug molecule on the pathogen as exhibited by the complex continuum of its phenotypic 
responses. This method consists of two key parts: first, biological image analysis is employed to automatically 
monitor and quantify shape-, appearance-, and motion-based phenotypes of the parasites. Next, we represent 
these phenotypes as time-series and show how to compare, cluster, and quantitatively reason about them using 
techniques of time-series analysis. 

Results: We present results on a number of algorithmic issues pertinent to the time-series representation of 
phenotypes. These include results on appropriate representation of phenotypic time-series, analysis of different 
time-series similarity measures for comparing phenotypic responses over time, and techniques for clustering 
such responses by similarity. Finally, we show how these algorithmic techniques can be used for quantifying 
the complex continuum of phenotypic responses of parasites. An important corollary is the ability of our 
method to recognize and rigorously group parasites based on the variability of their phenotypic response to 
different drugs. 

Conclusions: The methods and results presented in this paper enable automatic and quantitative scoring of high- 
throughput phenotypic screens focused on helmintic diseases. Furthermore, these methods allow us to analyze 
and stratify parasites based on their phenotypic response to drugs. Together, these advancements represent a 
significant breakthrough for the process of drug discovery against schistosomiasis in particular and can be 
extended to other helmintic diseases which together afflict a large part of humankind. 
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Background 

Neglected tropical diseases (NTDs) constitute the most 
common infections of the world's poorest people. This 
class of diseases encompasses a number of infection 
categories including helminth infections (schistosomia- 
sis, lymphatic filariasis, onchocerciasis), protozoan infec- 
tions (leishmaniasis, Chagas' disease, African 
trypanosomiasis), bacterial infections (cholera, leprosy, 
bovine tuberculosis), viral infections (dengue fever, 
rabies, yellow fever), fungal infections (Mycetoma, para- 
coccidiomycosis), and ectoparasitic infections (scabies, 
myiasis). Various studies have indicated NTDs to be the 
prime factors behind depriving the affected populations, 
especially women and children, of their health and eco- 
nomic potential [1,2]. This paper proposes a novel algo- 
rithmic approach to drug screening against 
schistosomiasis based on time-series analysis of pheno- 
types exhibited by parasites in response to different drug 
molecules. These phenotypes are themselves determined 
by automatically analyzing images from high-throughput 
screens. 

Schistosomiasis in humans is caused by three major 
species of trematodes, Schistosoma mansoni, Schisto- 
soma haematobium and Schistosoma japonicum. Affect- 
ing over 200 million people (with 20 million suffering 
severe effects) and placing over 800 million people at 
risk, schistosomiasis ranks second only behind malaria 
in terms of socio-economic and public health impact in 
developing countries. The disease starts as an inflamma- 
tory response to the eggs of the parasites and leads to 
fibrotic granulomatous, causing portal vein hypertension 
or occlusion (intestinal schistosomiasis caused by S. 
mansoni) or hydronephrosis and squamous bladder can- 
cer (urinary schistosomiasis caused by S. haematobium). 
Greatest infection intensities occur among children and 
adolescents, and the disease is known to undermine 
social and economic development in areas of high trans- 
mission [3-5]. Over the last 30 years, treatment and con- 
trol of schistosomiasis have come to rely on a single 
chemotherapeutic called praziquantel (PZQ). PZQ is 
given orally as a single-dose, has few side effects, and 
has bioactivity against the aforementioned three major 
species infecting humans. In spite of the advantages of 
PZQ, the reliance on a single drug to treat over 200 mil- 
lion people raises several serious problems. First, the 
emergence of drug resistance and possible drug failure 
is a major concern [3,6-8]. Indeed, increased parasite 
tolerance for the drug has been selected for in rodent 
hosts [9] and has been reported clinically in Egypt and 
sub-Saharan Africa [10-13]. Second, PZQ has important 
deficiencies in its therapeutic profile; cure rates vary 
usually between 60 - 90% [6-8]. Furthermore, the drug 
acts preferentially against the adult parasite, being mark- 
edly less effective (by 60 - 100%) against the juvenile 



schistosomula between 21 and 28 days old [14-16]. This 
decreased efficacy necessitates the re-treatment of indi- 
viduals harboring previously juvenile parasites and 
potentiates the risk for resistance by exposing (partially) 
refractory parasites to sub-curative doses [7,17]. The 
World Health Organization (WHO) has therefore 
declared schistosomiasis a disease for which new thera- 
pies are urgently needed [18]. 

Modern drug discovery conventionally begins by iden- 
tifying a molecular target (typically a protein or an 
enzyme) associated with a disease. Next, a large number 
of putative drug molecules are screened for activity 
against the target in in-vitro high-throughput screens 
(HTS) to identify "hits" which are passed onto later 
stages of the drug discovery pipeline for chemical opti- 
mization, optimization of the drug pharmacokinetics 
and pharmacodynamics, and ultimately clinical trials. 
The initial screening stage can typically involve a very 
large number of molecules (hundreds of thousands to 
millions), since even small variations in structure can 
significantly influence activity against the target. Given 
this context, we note that HTS platforms for Schisto- 
soma, using purified protein targets are almost 
unknown, with the very recent exception of a target- 
based campaign to identify inhibitors of S. mansoni 
thioredoxin/glutathione reductase (TGR) [19]. Tradi- 
tionally, researchers have tried to ameliorate this impe- 
diment by directly screening against the pathogen, in 
what can be termed as whole-organism screens. In such 
screens, typically a small number of molecules are tested 
by exposing the pathogen to them in vitro and the 
effects of the drug are captured using manual observa- 
tions. Against schistosomiasis, examples of such screens 
include [20-24] and have led to the discovery of drugs 
such as praziquantel and artemisinin [6]. 

The whole-organism screening approach differs from 
the conventional HTS-based strategy. HTS is built 
around the use of in-vitro single enzyme activity-based 
screens, single read-out cell-based assays, and involves 
very large number of molecules which are tested in par- 
allel using 96-, 384- or 1536-well plates. The distinctions 
of whole-organism screening from HTS, lead to both 
advantages and disadvantages. A crucial advantage is 
that the effect of a drug molecule can be studied in 
terms of the cumulative systemic effects it introduces in 
the parasite, rather than just in terms of how it interacts 
with a specific protein or enzyme in isolation. That is, 
the effects of the drug on the totality of targets and 
pathways can be explored in whole-organism screens. 
This can be expected to reduce the possibility of late- 
stage attrition of hits found through such screens. On 
the other hand, whole organism screens tend to be low 
throughput and are not easily extendable to HTS set- 
tings. This constrains, both in terms of diversity and 



Lee et al. BMC Genomics 2012, 13(Suppl 1):S4 
http://www.biomedcentral.com/1471-2164/13/S1/S4 



Page 3 of 24 



density, the exploration of the chemical space during 
lead-identification. Finally, as multi-cellular organisms, 
schistosomes display multiple and changing phenotypes 
in response to how compounds interfere with their nor- 
mal bio-chemical functioning, (see Figure 1). Capturing 
and quantifying the activity of a drug in terms of such 
rich and time-varying responses involves overcoming 
challenges of data processing, analysis, and modeling 
that are significantly more complex than those encoun- 
tered in single end-point assays common to biochemical 
target-based or cell-based screens [25]. 

Problem characteristics and proposed solutions 

An important long-term goal in the development of 
drugs against NTDs in general and schistosomiasis in 
particular, involves the development of high-throughput 
whole -organism screening methods. In the following, we 
enumerate some of the key challenges towards solving 
this problem and summarize the contribution of this 
paper towards addressing each of the challenges: 

1. Facilitating automated high-throughput data cap- 
ture and quantitative phenotype representation: The use 
of single end-point measurement of live or death' (e.g., 
ED 50 value) is over-simplistic when dealing with a 
multi-cellular and complex macro-parasites that can 
manifest a variety of temporally varying phenotypes. 
The need to screen compound libraries based on quan- 
tification of complex phenotypic responses of pathogens 
is also underlined by the fact that a drug may not neces- 
sarily lead to immediate death yet nonetheless perturb 
the parasite's ability to survive, e.g., through disruption 
of the larval migration program, tegumental perturba- 
tions releasing antigens targeted by the immune system, 
or the ability of adult worms to maintain position within 
the predilection site. As an example, the drug PZQ pro- 
duces both tetanic paralysis of the musculature, result- 
ing in loss of position as well as tegumental damage, 



and the exposure of surface proteins that then contri- 
bute to an immune system-mediated attack on the para- 
sites. We propose an image analysis-based approach for 
automatic segmentation and tracking of parasites and 
computation of descriptors that capture phenotypic 
responses in terms of changes in parasite shape, appear- 
ance, and motion. These descriptors are represented as 
time-series and provide a multi-dimensional time-vary- 
ing representation of parasite phenotypes. 

2. Analysis of phenotypes over time: We propose the 
use of time-series clustering to compare, differentiate 
and analyze the phenotypic response of parasites to dif- 
ferent drugs. Given the high-dimensional nature of the 
data and its sheer quantity, we investigate both numeric 
and symbolic representations of phenotypic time-series. 
We also investigate the applicability of different distance 
measures for comparing phenotypic responses. 

3. Dealing with variability of phenotypic responses: A 
challenge in analyzing data from whole-organism 
screens involving schistosomes is that they cannot yet 
be cloned. Furthermore, schistosome development in 
culture is asynchronous. Consequently, the response to 
a drug can show much greater variability than what is 
seen in molecular-target based or cell-based screening 
due to factors such as genetic variability, lack of syn- 
chronization, gender, stochastic noise, epigentics, and 
possibly the differentiated levels of resistance amongst 
individuals. Using the formalism of time-series cluster- 
ing, we show how this issue can be rigorously analyzed 
and examine the question of possible stratification of 
the phenotypic responses of different parasites that are 
exposed to the same drug molecule. 

4. Identification of representative phenotypic models: 
Given the inherent genotypic-phenotypic variability of 
parasites, identification of representative phenotypic 
models can be valuable for understanding the core 
trends in the data. From a statistical perspective, a 
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Figure 1 Examples of phenotypes exhibited by the schistosomula; (A) control (B) when exposed to the drug Lovastatin, and (C) when 
exposed to the drug Praziquantel (PZQ). 
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representative time-series can also be perceived as a 
description of the central tendencies in the data. We 
cast the problem of determining phenotypic models as 
that of finding representative time series for each phe- 
notype cluster in the data. We present different algorith- 
mic methods to address this question, and discuss case 
studies that show its usefulness for information sum- 
marization and presentation. 

Distinctions from prior research 

The use of quantitative phenotyping in biology and drug 
discovery has occurred along two directions. The first of 
these involves the study of phenotypic variations in 
model organisms such as C. elegans, as a function of 
gene-knockouts or changes in environmental conditions. 
The second direction involves dug screening based on 
cytological profiling. The distinctions of our problem 
formulations, methods, and results from research in 
both these directions are significant. Compared to the 
C. elegans mutants [26-30], the morphology and appear- 
ance of schistosomula are more complex and undergo a 
greater variety of changes when exposed to different 
compounds as clones of genotypically identical schisto- 
somes do not exist. Furthermore, our input data consists 
of images and video of multiple parasites residing in 
multi-well plates used in HTS. This leads to imaging 
conditions very different than those used for C. elegans 
(which typically involve large Petri-dishes with only few 
worms per dish). Our problem formulation shares the 
goal of lead identification with cytological screening 
[31-33]. However, our challenges are more acute and 
complex: the phenotypic response of (genetically 
diverse) parasites tends to be both more nuanced as 
well as more diverse than what is observed in cell-based 
screens. Furthermore, analysis of cellular phenotypes 
can use simplified models of cell shape. As we shall dis- 
cuss in the next section, such models cannot be easily 
obtained for Schistosoma, 

The investigations and results presented in this paper 
extend the framework proposed by Singh et al. for auto- 
mated phenotypic screening [34]. The method proposed 
in [34] used mean-shift-based segmentation & tracking 
of parasites. In it, phenotypes were classified using clas- 
sification and regression trees (CART). The method pre- 
sented in this paper differs from [34] in terms of the 
algorithmic approach used for parasite segmentation 
and tracking. Further, in this paper, we use the entire 
response of the parasite (represented as a time-series) 
for phenotype analysis. In [34] on the other hand, 
CART-based classification was performed on each indi- 
vidual frame of the video. Finally, to the best of our 
knowledge, neither time-series analysis, nor the specific 
problems investigated by us in terms of representation 
and analysis of phenotypic variability have been 



considered either for investigation of C. elegans pheno- 
types or for cytological profiling. 

Method 

Parasite identification by image segmentation 

In contrast to cellular segmentation, a topic that has 
received considerable attention in bio-image analysis, 
the problem of segmenting schistosomula in drug 
screens presents certain specific challenges. First, unlike 
cells, a geometric or appearance-based model of parasite 
shape cannot be assumed a priori. This is due to a num- 
ber of factors including the fact that the movement of 
the schistosomula is based on elongation and contrac- 
tion of their musculature. Therefore, the shape of the 
parasite's body can change considerably within a single 
movement cycle. Furthermore, drug action can distort 
the parasite body and appearance in unique manners (as 
can be seen in Figure 1). Second, parasites contain visi- 
ble "inner" anatomical structures that complicate seg- 
mentation by creating edges that do not correspond to 
the boundaries of the body. Third, parasites often touch 
each other in varying configurations. This often compli- 
cates identification of individuals. Finally, observation 
periods can be long (days) and debris in the background 
can accumulate leading to an increase in the false posi- 
tives. Our approach to parasite segmentation builds on 
prior research on segmentation of cells and parasites in 
biological images [34-36]. The basic idea involves a 
multi-step process, where the first step distinguishes 
between the foreground (corresponding to the parasites) 
and the background. In the second step, the foreground 
objects are analyzed and filtered to remove false posi- 
tives. In the final step, a series of morphological opera- 
tors are applied to separate individual parasites. In the 
following, we describe each of these steps in detail. 

In order to distinguish the background from the para- 
site, we modify and extend the region-based voting 
approach from [36]. First a low-pass filter is applied to 
the image to remove noise and smooth the image. Next 
an adjustable threshold y which is described later in this 
section is subtracted from the image. The result is 
asymptotically bounded using a sigmoid function to 
restrict values to between +1 and -1, resulting in pixel 
values below the threshold y being skewed towards the 
background because they have a higher weight. Subse- 
quently, in a manner similar to [36], a region-based dis- 
tributing function (RDF) R 1 defined in Eq.(l), is applied 
to the image. The purpose of this function is to rapidly 
identify the background by assigning a higher weight to 
background pixels. The functional form of the RDF used 
by us differs from the one proposed in [36] in two ways: 
first, Eq.(l) contains only a single parameter y and sec- 
ond, this parameter is interpreted in our approach as a 
threshold on the intensity difference between the 
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background and foreground. 

R 1 (n) = -sig{(f®h){n)-y) (1) 

y = min{(/ max - J min )/2, k} (2) 

In Eq.(l),/ denotes the image, h the low-pass filter, 0 
the convolution operator and sig(.) the sigmoid function. 
The starting threshold y is determined as per Eq.(2), 
where I max and I min denote the highest and lowest 
intensity values in the image. The value of the cutoff A 
in Eq.(2) was empirically established to be 50 based on 
an intensity range of [0, 255]. The region-based distri- 
buting function is iteratively applied with a decreasing 
threshold. As the threshold y decreases, that is, as the 
difference in the foreground and background intensities 
becomes smaller, the segmented results increasingly 
begin to include regions with intensity values closer to 
the background intensity. Initially, this increases the 
number of true positives (parasites) and progressively 
false positives occur. In the limit, the number of fore- 
ground objects decreases to one (as the entire image is 
treated as the foreground). Thus, a graph of the number 
of foreground objects has a left-shifted single mode. We 
found that tolerating a certain amount of false positives 
led to segmentations with the largest number of actual 
parasites correctly distinguished. Consequently, the 
threshold value which leads to the greatest increase in 
the number of segmented regions in the image is used 
to define y. The false positives generated as a conse- 
quence of the oversegmentation tend to be small regions 
(of less than 200 pixels) representing debris from the 
background and are removed using region-size-based fil- 
tering. An intensity-based filter is subsequently applied 
as part of the filtering process. In it, the average inten- 
sity of each foreground region is compared to the over- 
all average intensity of all the foreground regions. 
Regions with values outside of one and half times the 
standard deviation are considered background and 
removed. 

In the final step of the segmentation, morphological 
processing is employed to separate touching parasites. It 
begins by detecting the edges of the original image. For 
this, the Canny edge operator [37] is used and the 
detected edges are subtracted from the foreground 
image. A dilation-erosion step is next applied to remove 
internal gaps that might have occurred due to edge sub- 
traction. Next, relevant edge pixels are found and sub- 
tracted from the foreground. These pixels correspond to 
edges that separate connected components. In terms of 
image connectivity this means that every such edge 
pixel in the labeled image must have at least two differ- 
ent labels in its 8 neighbors. The removal of relevant 
edges results in separation of two different regions that 



are touching. The results from different stages of this 
method are shown in Figure 2 (top row). The first 
image in Figure 2 (bottom row) shows the final segmen- 
tation result. Other images in the bottom row illustrate 
the complexity of the segmentation problem by present- 
ing results obtained on this image using three well 
known segmentation techniques: mean-shift [38], JSEG 
[39], and active-mask [36]. As can be seen, all of these 
methods face difficulty in distinguishing touching 
parasites. 

Parasite tracking 

The ability to analyze time-varying phenotypic response 
of parasites requires tracking each parasite across the 
entire video sequence. Given an initial segmentation, for 
each parasite, this involves establishing a correspon- 
dence between its positions in successive frames. Once 
the parasites are tracked across the video, their appear- 
ance, shape, and motion can be described quantitatively. 
In designing a tracking system for the parasites in HTS, 
the following challenges have to be addressed: 

1. Robust handling of the erroneous or ambiguous 
segmentation of the parasites. Specifically, due to 
their tendency to mingle, errors in segmentation can 
result in clusters of parasites merging and splitting 
in a variety of ambiguous combinations (see Figure 3 
(a)). A closely related problem involves dealing with 
cases, where one parasite is incorrectly segmented 
into two regions. 

2. Precision in defining individual parasites, so that 
the phenotypes can be accurately measured over the 
entire duration of observation. This is especially 
important since we plan to use the entire phenotypic 
response of the parasite in our analysis. 

3. Accounting for the unique motion characteristics 
of the parasites; unlike many problems in vision- 
based tracking where the object being tracked moves 
rapidly, schistosomula can exhibit significant move- 
ment due to twists and turns of their bodies, without 
appreciable translation of their body positions. 

We design our tracking approach to consist of three 
conceptual levels: the segmentation level, the blob level, 
and the parasite level. The segmentation method 
described in the previous section constitutes the seg- 
mentation level. The blob level is based on a graph-the- 
oretic approach to track blobs proposed in [40]. This 
method offers a data-driven model for representing the 
merging and splitting behavior of regions. However, this 
approach assumes that objects can be reliably separated 
from the background based on motion and does not 
deal with issues arising out of segmentation errors that 
can occur in cluttered settings, such as when a single 
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Figure 2 Illustration of the segmentation results and comparison with other methods. Top row from left-to-right: The original image 
(note that the bottom right region has a shadow), results of the region-based distributing function showing oversegmentation, relevant edges, 
and the image after filtering of debris and small regions. Bottom row from left-to-right: Final results with the proposed method after closing 
and filling holes in regions and separation of touching parasites, results obtained by mean-shift segmentation [7], results of JSEG segmentation 
[9], and results from the active-mask segmentation method [35]. 



parasite is erroneously split across multiple regions or 
when multiple parasites are too close to be separated. 
We propose a novel approach to deal with these chal- 
lenges in the parasite level. In the following, we describe 
the blob and parasite levels. 

In the blob level, each distinct foreground region 
(putatively representing a parasite) is represented using 



its bounding box in the .ry-coordinate (image) plane. In 
the following, we call each such region a blob. Next, an 
undirected bipartite graph G/(V^£/) is used to model 
associations between blobs in consecutive frames such 
that V i =B i UB i _ 1 , where B t and B^j are sets of vertices 
corresponding to blobs in frames i and Specifically, 
an association between a blob u in frame i-1 and a blob 




Figure 3 (a) Parasites can be located in close proximity to each other in manners that lead to segmentation errors, (b) Bipartite graph 
describing the splitting of a blob (containing two parasites) into two blobs containing a single parasite each, (c) Group of four parasites, 
erroneously assigned to a single blob after segmentation. Analysis of the various combinations of intensity-boxes leads to the recovery of one of 
the four parasites from this blob. 



Lee et al. BMC Genomics 2012, 13(Suppl 1):S4 
http://www.biomedcentral.com/1471-2164/13/S1/S4 



Page 7 of 24 



v in frame i is represented as an edge (u, v) in G t . In 
this graph, two blobs that share an edge are called 
neighbors and the set of neighbors of a vertex (blob) u 
is denoted by N^u). Further, the number of associations 
of a blob u is described by its degree, deg(u). Figure 3 
(b)) illustrates an example graph of blob associations. 
The advantage of the blobs model is that it allows for 
the definition of many-to-many relationship between 
blobs throughout the video, such that blobs can merge, 
split, appear, or vanish depending on the information 
obtained from the segmentation level Following [40], 
two constraints are used to reduce the set of edges in 
Gf. the locality constraint, requiring that two blobs be 
sufficiently close to each other in order to be considered 
as neighbors and the parent structure constraint requir- 
ing that a blob not merge and split simultaneously. 
Finally, the correspondence between blobs is determined 
by enumerating the set of all edges between blobs in 
frame i-1 and those in frame i and ranking the resulting 
graphs to choose the most viable correspondence. To 
design a cost function for ranking graphs, two disjoint 
sets, called parents P t and descendants D it are defined 
Eq(3)-Eq(5). Of these, Eq.(3) and Eq.(4) are straightfor- 
ward. Eq.(5) captures the relationship that the parent set 
can be computed by taking all vertices of degree greater 
than one, all vertices of degree zero, and all vertices of 
degree one which are only in blob B^. 

Vi = PiUDi (3) 
Di = UNi(w), u e Pi (4) 

Pi = u\ deg(w) > 1 U v\ deg(y) = 0Uu;| deg(w) = 1 aw e (5) 

The cost function used to rank the graphs is shown in 
Eq.(6). In this equation, A(u) denotes the area of blob u, 
the summation corresponds to the total area occupied 
by the neighboring blobs of blob u, and P t denotes the 
parent set. This function penalizes graphs in which a 
significant change occurs between the sizes of corre- 
sponding blobs. As can be seen from this equation, in a 
perfect match, the blob size would be unchanged. 

A(u) - £ M v ) 

Cost(Gi) = V ^ (6) 

1 l) ^max{A(u),£A(i/)} 

v{u) = 1 if N > k * logA, v{u) = 0, otherwise (7) 

In the parasite level, our approach takes a different 
strategy than that proposed in [40] to be able to address 
issues occurring due to the closeness of parasites in the 
scene. The parasites level processing begins by assuming 
the association of exactly one parasite with each blob in 



the first frame of the video. Next, we iterate through the 
video and associate the parasites with their respective 
blobs in each frame. Specifically, if a parasite is asso- 
ciated with a blob in frame then it is also associated 
with all neighbors of that blob in frame L Recall that the 
neighbors of each blob are given by the blob bipartite 
graphs computed at the blobs level. At this point, we 
estimate the number and location of parasites occurring 
within groups, where the results from the segmentation 
layer may not be highly accurate, based on a notion we 
call the volatility factor and denote by v(u) for a blob u. 
The volatility factor v(u) is determined based on the 
fraction of frames in which the blob u participates in 
splits or merges (See Eq.(7)). In this equation, A denotes 
the area of the blob (in pixels), h is a coefficient, and N 
defines the threshold number of frames. If a blob parti- 
cipates in splits/merges in large number of frames 
(exceeding a number A/), then it is deemed to be vola- 
tile. Our hypothesis is that a blob having a high volatility 
factor is more likely to include erroneously segmented 
parasites than the one which has a low volatility factor. 
Operationally, to use this idea, we divide the blobs into 
three groups: small, medium, and large. Following Eq. 
(7), for small blobs, a split in approximately 20% of the 
frames was empirically determined to be the threshold 
for volatility. Similarly, for medium blobs, a split in 6% 
or more of the frames was used as the volatility thresh- 
old. Finally, for large blobs, a split in 1% of the frames 
was used as the threshold. Once the volatile blobs are 
identified, the tracker re-analyzes them starting with the 
first frame. To estimate the number and position of 
parasites in each blob, corresponding blobs from succes- 
sive frames are overlaid and the underlying intensity 
values are superimposed to create a heat-map represent- 
ing the possible locations of parasites. The bounding 
boxes of the blobs are then weighted and ranked based 
on the size of the box and the frequency of occurrence: 
a high scoring intensity box is expected to be close in 
size to the average bounding box of a single parasite. 
Excessively large intensity boxes are likely to contain 
multiple parasites, and very small intensity boxes are 
likely due to noise. The highest scoring intensity boxes 
are then considered in various combinations. For small 
blobs, combinations of two intensity boxes are consid- 
ered, for medium blobs the combinations include up to 
three intensity boxes. Finally for large blobs, combina- 
tions of up to four intensity boxes are analyzed. These 
combinations are ranked to favor: (1) configurations 
where the participating intensity boxes are reasonably 
spread out (minimal overlap), (2) configurations where 
the total area of the combined intensity boxes is close to 
the area of the original parasite, (3) configurations where 
the bounding box of the combined intensity boxes 
roughly matches the dimensions of the bounding box of 
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the original volatile blob. The highest ranked combina- 
tion represents the most likely dimensions and locations 
of the new parasites. Figure 3(c) shows an example of a 
group of 4 parasites that was erroneously segmented as 
a single region in the initial frame of the video. For this 
case, ranking the combination of intensity boxes allowed 
the recovery of 1 of the 4 parasites. 

Quantitative description of phenotypes 

In Table 1 we list the descriptors that capture the 
size, shape, motion (speed of movement), and appear- 
ance (intensity, color, and texture) of the parasites at 
every frame. These descriptors constitute the time- 
series representation of phenotypes used hereafter. 



Five classes of descriptors are used: size, shape, move- 
ment, texture, and color. Size descriptors measure the 
size of a parasite by the total number of pixels of the 
body of the parasite and the area change of the para- 
site between two consecutive time steps. The shape 
descriptor is calculated by dividing the length between 
two end points of the skeleton of the parasite by the 
skeleton length so that the ratio indicates how curved 
the parasite's body is. The movement of the parasite 
is measured by the pixel difference between two con- 
secutive time steps: large changes in parasite shape or 
position lead to large difference value. Finally, color 
and texture descriptors capture the parasite 
appearance. 



Table 1 Quantitative phenotype descriptors and their descriptions 



Descriptor name 


Formula 


Description 


Size 


Area 


See description 


The total number of pixels identified during segmentation. 


Change in area 


Area(t) - Area(M) 


The area of the parasite in the current frame at time t minus the area of the parasite in the 
previous frame (time t-1). 


Shape 


End point length/ 
Skeleton length 


See description 


Ratio of the Euclidean length of the shortest line between the two endpoints of the skeleton to 
the length of the skeleton. The skeleton is created by thinning the segmented region until it is 
represented by a line corresponding to the curve of the body. Branching of the skeleton is 
handled by iteratively applying the MATLAB spur operator that identifies and removes isolated 
edge points until only two edges remain [47]. 


Movement 


Image difference 


Image(M) - Image(t) 


The number of pixels that moved from time t-1 to t of the parasite. 


Perimeter (also for 
description of size) 


See description 


The number of pixels representing the boundary of the segmented region. 


Axis ratio (also for shape 
description) 


MinorAxisLength/ 
MajorAxisLength 


Ratio of the minor axis length to the major axis length. The major and minor axes are 
computed for an ellipse with the same normalized second central moments as the region. 


Texture 


Entropy 


-sum(p*\og 2 p) 


Statistical measure of randomness related to the texture of an image where p contains the 
grayscale histogram. 


Contrast 


m 2 pdj) 


The intensity contrast between a pixel and its neighbors throughout the region. 


Correlation 


i,j 1 


The intensity correlation between a pixel and its neighbors. 


Energy 




The sum of the squared elements in the GLCM (gray-level co-occurrence matrix). The GLCM 
measures how often two intensities occur side by side. 


Homogeneity 


y- P(i,)) 


Measures the closeness of the distribution of the elements in the GLCM to the GLCM diagonal. 


Color 


Average grayscale 


See description 


The mean intensity value and standard deviation found in the region. 


Average red 






Average green 






Average blue 






Grayscale histogram 


See description 


A histogram with bins 0-255 representing the count of each intensity value present in the 



region. 

Red histogram 
Green histogram 
Blue histogram 
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Time series analysis of phenotypes 

In analyzing the phenotypic responses of individual para- 
sites, our goal is to identify groups of similar phenotypic 
patterns. Conceptually, this problem requires clustering 
the phenotypes over time. However, when the time 
dimension is involved, the clustering problem becomes 
harder because each data point is not an individual 
instance but a sequence of data (collected over time). 
This implies that we are dealing with very high-dimen- 
sional data. Furthermore, given that our solution needs 
to work in high-throughput settings over large data sets, 
efficiency becomes a paramount consideration. Repre- 
senting a time series symbolically constitutes one of the 
well known ways of complexity reduction. This approach, 
called SAX, has also been shown to improve clustering 
due to the smoothing effect of dimensionality reduction 
[41]. We briefly review the SAX approach (Section "Sym- 
bolic representation of time series (SAX)"). Then we 
show why the idea of representing a time-series with a 
fixed number of piece-wise linear segments as done in 
SAX is not directly applicable to our situation and pro- 
pose a data-driven solution (Section "Automatic determi- 
nation of piecewise aggregate approximation"). We 
discuss different distance measures for comparing the 
time-series generated in our problem (Section "Definition 
of a similarity measure between time-series"). The results 
from the aforementioned sections allow us to formulate 
the clustering techniques we use in this work to quantita- 
tively analyze and differentiate the phenotypic response 
of parasites (Section "Clustering time series representa- 
tion of phenotypes"). Finally, given clusters of similar 
phenotypic responses, we consider the questions of 



finding representative phenotypic models (Section "Iden- 
tifying representative time series"). 
Symbolic representation of time series (SAX) 
SAX [41] represents a time series of arbitrary length n 
in a string of arbitrary length w (w <n, typically w «n). 
This method consists of two steps. The first step 
involves a normalization of the original data to N(0,1), 
and then a transformation of the normalized data into a 
piecewise aggregate approximation (PAA) is performed. 
In the second step, the PAA is converted into a discrete 
string. Figure 4 shows an example of how a given time 
series of length n is represented in a w-dimensional 
space using the above steps. Each segment in the PAA 
indicates the average of the data points (PAA coeffi- 
cient) along the segment. In this example, the length of 
the original data was 120 and the sequence was reduced 
to length 8. In the second step, a discrete representation 
is obtained based on producing symbols with equal 
probability. Given the normalized time series with Gaus- 
sian distribution from the first step, breakpoints are 
determined to divide the Gaussian curve into a equal- 
sized areas. These breakpoints are determined by look- 
ing up in a statistical table. Table 2 shows the break- 
points for values from 3 to 10. Once the breakpoints are 
obtained, a time series is discretized such that all PAA 
coefficients below the smallest breakpoint are mapped 
to the symbol "a", all coefficients greater than or equal 
to the smallest breakpoint and less than the second 
smallest breakpoint are mapped to the symbol "b", and 
so on. The three horizontal lines in Figure 4(b) repre- 
sent the three breakpoints that produce four equal-sized 
areas under the Gaussian curve. 




Figure 4 (a) Original HTS data showing the effect of the drug Imipramine in terms of "Area" for a parasite and its PAA representation, 
(b) the symbolic representation. The sequence of length 120 {n = 120) is reduced to 8 dimensions (w = 8) and is represented by the four SAX 
symbols {a = 4). X-axis denotes the frame number. The parasite body size (in pixels) is shown on the Y-axis. 
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Table 2 A lookup table that contains the breakpoints 



p 








a 












3 


4 


5 


5 


7 


8 


9 


10 


n 

Pi 


-0.43 


-0.67 


-0.84 


-0.97 


-1 .07 


-1 .15 


-1 .22 


-1 .28 


P2 


0.43 


0 


-0.25 


-0.43 


-0.57 


-0.67 


-0.76 


-0.84 


P3 




0.67 


0.25 


0 


-0.18 


-0.32 


-0.43 


-0.52 


P4 






0.84 


0.43 


0.18 


0 


-0.14 


-0.25 


P5 








0.97 


0.57 


0.32 


0.14 


0 


P6 










0.97 


0.67 


0.43 


0.25 


P7 
P8 












1.15 


0.76 
1.22 


0.52 
0.84 


P9 
















1.28 



Automatic determination of piecewise aggregate 
approximation 

Directly applying SAX to large and varied data common 
to HTS, requires properly estimating the two control 
parameters w and a. Through extensive experiments, we 
found that values of the alphabet size a = 5 or a = 6 led 
to representations that allowed for good discrimination 
without being overly influenced by noise. A similar 
observation was also reported in [41]. However, we 
could not find any strategy that could consistently esti- 
mate a good value of w (the number of segments) a 
priori. We therefore propose a strategy to determine w 
in a data-dependent manner based on a method that 
was originally proposed for shape representation using 
optimal polygonal approximation of digital points [42]. 
This method uses the L x norm to find the set of longest 
line segments, which fit the data with the minimum 
sum of absolute errors along each of the line segments. 
Given n data points and indices / and / such that i = 
l,...,n-2 and ; = i+2,...,n, an objective function is defined 
as: 

Fj = Lj-Ej, (8) 
where 

Lj = ylb-Xif + iYj-Yif (9) 
and 

£ j = ELL 'to ~ Yi ^ Xk ~ to ~ Xi ^ n ~ XiYj + x jy^iy/( x j- x ^ 2 + ^-y^ 2 ' (10) 

The goal of the objective function is to maximize the 
length of a line segment Lj and to minimize the error Ej 
simultaneously. Algorithm 1 is the pseudo code that 
finds a set of break-points P{Xj,yj) which produces opti- 
mal line segments from a given time series. Initially, / = 
1 and j starts from i+2. If < Fp then the value of I< is 
incremented. Otherwise, j~l becomes a breakpoint and a 
new segment is started with the two end points, i = j-1 
and / = i + 2. Since a time series is not a polygon but a 



curve along time dimension, we modify the original 
algorithm so that the start point of the new segment 
occurs after the end point of the previous segment (line 
20 in Algorithm 1). Figure 5 shows the optimal seg- 
ments and symbolic representation of the time-series 
from Figure 4. In it the number of segments can be 
seen to have increased from 8 to 32 and the length of 
each segment is determined solely by the data it 
represents. 

Algorithm 1. OptimalSegmentation 

l.i = l 

2. breakpoints <r- {} 

3. while i <n 

4. / = i + l 

5. if j == n 

6. break; 

7. end if 

8. compute Fj 

9. while / <n 



10. 7=7+1 

11. compute Fj 

12. if Fj < Fj 

13. Fj <r- Fj' 

14. else 

15. 7 ^ 7-1 

16. breakpoints <— breakpoints U/ 

17. break; 



18. end if 

19. end while 

20. i=j 

21. end while 

Algorithm 2. ExpandCluster 

1. seeds <— getNeighbors(Z), p, s) 

2. if size(result) >MinPts 

3. /?.clusterld <— none 

4. return False; 

5. else 



6. update /?.clusterld 

7. seeds <— delete(seeds, p) 

8. while ~isempty(seeds) 

9. currentP <— getFirstSeed(seeds) 

10. result <— getNeighbors(Z), currentP^ s) 

11. if size(result) >= MinPts 

12. for i from 1 to size(result) 

13. q <— get(result, i) 

14. if ^r.clusterld is unclassified or noise 

15. if ^r.clusterld is unclassified 

16. seeds <— append(seeds, q) 
Y7. end if 

18. update ^r.clusterld 

19. end if 

20. end for 

21. end if 

22. seeds <— delete(seeds, currentP) 
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Figure 5 Optimal segmentation and symbolic representation of the time-series from Figure 4(a). The frame number is shown on the X- 
axis and the parasite body size (in pixels) is shown on the Y-axis. 



23. end while 

24. return True 

25. end if 

Definition of a similarity measure between time-series 

Given two time series Q and C and their symbolic 
representations Q and q the distance between the two 
corresponding strings in SAX is given by Eq.(ll): 



m is the length of s and n the length of t, then comput- 
ing LD requires filling a (m+l)x(n+l) distance matrix D 
with a distance between each pair of characters of the 
two strings. Initially, the first row of the matrix is set to 
0,...,m and the first column of the matrix is set to 0,...,h. 
Next, for each cell d[i,j] in D, a value is computed as 
shown in Eq.(12). The final distance between the strings 
is given by d[m,ri\. 



MINDIST(Q, C) = J-JJ2 i=1 Ci)f (ID dM . min{d[i 



l,j],d[i,j- 1] + l,d[i- 1] + cost}, (12) 



The disti) function in Eq.(ll) is implemented by SAX 
using a lookup table defined in Table 3. For a given 
alphabet size a, the lookup table needs to be calculated 
only once. Figure 6 illustrates how two time series can 
be represented using symbols and how their similarity 
can be measured using SAX by matching the corre- 
sponding symbols. However, this method cannot be 
applied to our problem, since a data-driven PAA repre- 
sentation can map two time-series to symbolic represen- 
tations of different lengths. 

For comparing time-series as represented by symbolic 
representations of varying lengths, we investigate the 
following two distance measures: 

♦ Levenshtein distance measure (LD) [43]: Given a 
source string 5 and a target string t, the LD between the 
two strings is defined as the number of deletions, inser- 
tions, or substitutions required to transform s into t. If 



Table 3 A lookup table used by the MINDIST function, [a 
= 4) 





a 


b 


c 


d 


a 


0 


0 


0.67 


1.34 


b 


0 


0 


0 


0.67 


c 


0.67 


0 


0 


0 


d 


1.34 


0.67 


0 


0 



where 

i = 0, m,j = 0, n, and cos t ■■ 



ifs[i\ == t\j],0 
otherwise, V 



♦ Dynamic time warping (DTW): is an algorithm that 
finds an optimal alignment between two sequences. The 
warping path is found using dynamic programming 
using the following recurrence: 

r[i,j] = d[i,j] + min{r[i — l,j],r[i,j — l],r[i — l,j — 1]}, (13) 

where d[i,j] is the distance in the current cell and r[i,j] 
is the cumulative distance. 
Clustering time series representation of phenotypes 

We investigate two different clustering methods for 
clustering the time-series based description of pheno- 
types: agglomerative hierarchical clustering and 
DBSCAN. Agglomerative clustering is a well known 
method for grouping biological data. DBSCAN [44] is a 
density-based clustering method and was originally 
developed for large spatial databases with noise. This 
method finds clusters with arbitrary shape, that is, it is 
not restricted to spherical clusters. 

In DBSCAN, given an object p (time series in our 
case), an e-neighborhood is used to denote other objects 
(time series) that lie within a radius s of d. An object p 
is called a core object if it has a minimum number of 
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Figure 6 (a) Two original time series C and Q (b) PAA representations of the two original sequences using the uniform segmentation 
(c) The symbolic representations of the two PAAs from (b). X-axis denotes the frame number and the Y-axis denotes the body size (in 
pixels). 



objects, MinPts, in its ^-neighborhood. Given a set of 
objects, p is said to be directly density -reachable from a 
core object q, if p is within the ^-neighbor of q. Analo- 
gously, an object p is density -reachable from object q 
with respect to s and MinPts if there is a chain of 
objects p lt ... t p m pi = q, p n = p such that p i+1 is directly 
density- reachable from p b Finally, an object p is density- 
connected to an object q with respect to £ and MinPts if 
there is an object o such that both p and q are density- 
reachable from o with respect to s and MinPts, 

Our use of DBS CAN starts with an arbitrary time series 
p, and retrieves all density-reachable time-series from p 
with respect to £ and MinPts. To estimate these para- 
meters we analyzed the results from agglomerative clus- 
tering. The y-axis of the dendrogram is the distance 
between sequences. The linkages visually show which 
objects are closely located and helps users approximate 
the distances among objects. Therefore, in this paper, £ 
was set to the distance of two farthest objects within the 
cluster from the dendrogram. For the variable MinPts, 
we used the value of 2 or 3 not only to avoid singleton 
clusters but also to identify phenotypes that are distant 
from other phenotypes. In other words, if singleton is 
found, it means that the phenotype is truly distinct. If p 
happens to be a core object, then a cluster is generated. 
Algorithm 2 shows the pseudo code that checks whether 
p is a core object and expands a cluster if so. Initially, all 
objects are unclassified. For every unclassified object, 
ExpandCluster is called to see if the object is eligible to 
be a core object (line 2). If the object is indeed a core 
object, then all the neighboring objects within the radius 
£ are inserted into a queue (as they are directly density- 
reachable). For each object in the seeds any unclassified 
object in its ^-neighbors is inserted into the seed queue 
and is classified by being assigned a clusterld (line 18). By 
doing so, a cluster is formed by a set of density- 



connected objects that is maximally density-reachable. 
Every object not contained in any cluster is treated as 
noise. 

We use two data sets to investigate the applicability of 
hierarchical clustering and DBSCAN. The first data set is 
the synthetic control chart time series from the UCI 
machine learning repository. Because the class label of 
every sequence is known for this set, we use it for study- 
ing the validity of the resulting clusters. Three sequences 
of length 60 are selected from each of the three classes; 
normal, decreasing, and upward shift. Figure 7 shows 
these time series along with the optimally determined 
segments and the symbolic representations for each 
sequence. For the UCI data set, the dendrograms in Fig- 
ure 8 show the clusters found by the agglomerative hier- 
archical clustering using four distance functions; 
Euclidean, MINDIST, Levenshtein distance, and DTW. 
The original time series were used for Euclidean distance 
and the symbolic representations were used for the other 
distance functions. As the Figure 8(a) shows, the three 
normal trends were not properly identified when Eucli- 
dean distance was used. For the original SAX method, 
the two control parameters were exhaustively explored 
and all combinations of alphabet size and segment length 
were tried (a = [3-8], w = [5,10,15,20,30]). Among the 30 
combinations, the combination of w = 10 and a = 6 (Fig- 
ure 8(b)) provided the best clustering with respect to the 
ground truth. For Levenshtein distance (Figure 8(c)) and 
DTW (Figure 8(d)), the automatically determined opti- 
mal segments were used. Calculation of DTW was also 
employed to compute distance between two strings. 
DBSCAN also found the cluster membership of each 
sequence, clusterl = (1, 2, 3), cluster2 = (4, 5, 6) and clus- 
ters = (7, 8, 9). 

The second data used by us is a control group. For 
our experiment, we use the image difference descriptor 
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Figure 7 Optimal segmentation and symbolic representation of the sequences from the UCI database with a = 6. The X-axis denotes 
the time step and the sequence value is shown on the Y-axis. 



and study 22 parasites. This descriptor corresponds to 
the motion exhibited by a parasite; greater the motion, 
larger the image difference. Figure 9(a) shows the den- 
drogram constructed by the agglomerative hierarchical 
clustering using this descriptor. The clusters found by 



DBSCAN using DTW were cluster- 1 = (4, 12, 14, 16, 
17, 19, 20), cluster-2 = (1, 3, 9, 11, 13, 18), cluster-3 = 
(7, 8), cluster-4 = (2, 5, 6, 10, 15), and noise = (21, 22) 
with s= 0.0031 and MinPts = 3 for this data set. In 
Figure 9(a), cluster- 1 and cluster-2 are linked by the 
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Figure 8 Dendrograms constructed by the agglomerative hierarchical clustering for the UCI dataset. The X-axis denotes the sequence 
number and the Y-axis denotes values obtained using various distance measures. 
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Figure 9 Dendrograms constructed by agglomerative hierarchical clustering using values of the image difference descriptor: (a) 
Control group (b) Phenotypes measured on the 7 th day after exposed to drug Imipramine. The X-axis: denotes the sequence number 
and the Y-axis denotes distance. 



same color (red) but DBSCAN separated them with 
the two parameter setting (e= 0.0031 and MinPts = 3). 

For this data set, no significant difference was found 
in the results from DBSCAN and agglomerative hier- 
archical clustering, indicating the validity of the cluster- 
ing results of both methods. A manual inspection of the 
data confirmed this conclusion and the clusters that 
were obtained. 

The third data set captures the effect of the drug Imi- 
pramine after parasites had been exposed to it for seven 
days. DBSCAN found two clusters for this data set, clus- 
ter-1 = (1, 7, 9, 11, 13, 14) and cluster-2 = (2, 3, 4, 5, 6, 
8, 10, 11, 12, 15, 16) with e= 0.03 and MinPts = 3. Clus- 
ters that are similar to the two clusters found by 
DBSCAN can be found by cutting off the black and the 
topmost red links in Figure 9(b). It confirms that there 
was no significant difference in clusters found by the 
two different clustering methods, DBSCAN and agglom- 
erative hierarchical clustering. 
Identifying representative time series 

Finding a representative time series for a given cluster 
requires identifying one of the constituent time series 
which best characterizes the phenotypic diversity of the 
cluster. Different principles can be used to identify such 
a representative. In this paper, we propose two methods 
that approach this question from different perspectives. 
In the first method, we define the representative to be a 
time series that has the minimum sum-of-distances 
(MSD) with all the other time series in that cluster. We 
use DTW defined over the symbolic time-series repre- 
sentations to determine the representative. 

The second method used by us is based on the notion 
of a low dimensional vector called sketch, which has 
been used for discovering approximately repeating 



subsequence [45] . The sketch vector 5 of length k for a 
subsequence vector fof length n is defined as: 

«[*'] = M • v <[/] (14) 

In Eq.(13), i=l,..,k and v { is a random vector. Each ele- 
ment of Vi is an independent random variable with nor- 
mal distribution N(0,1) and the magnitude of the vector 
is normalized to 1. For example, 5 of length 2 for 
t = (3, 1,2,4) is computed as following: first, two ran- 
dom vectors are chosen and then normalized, 
Vi = (-0.6895, 0.1717, -0.2970, -0.8008) T and 
v 2 = (0.4708, -0.2446, -0.0450, 0.2179) T . Next, the 
sketch vector is computed using Eq. (13). In this exam- 
ple, S = (t-vi,t-v 2 ) = (-5.6940,1.9494). In our 
method, all the time series in a cluster are transformed 
to sketches of length 30. Once a sketch pool is obtained 
for the cluster, the L 2 norm is applied to identify a 
sketch with the least sum of distances to all other 
sketches and the time series corresponding to this 
sketch is declared as the cluster representative. 

In Figure 10 we show the representative time series 
identified by DTW and sketching for the two clusters 
obtained by cutting the links off around 0.003 in Figure 
9(a). As the reader may note, the movement of the 
representative parasite in clusterl is greater than that of 
the representative parasite from cluster 2, with both the 
methods. This behavior is consistent with manual obser- 
vations. In Figure 11, we show the shape change of the 
representative parasites from each of the two clusters 
(every 45 th frame in the observation period is depicted). 
Since the movement occurred over the same time-dura- 
tion, the reader can see greater motion exhibited by 
parasites from the first cluster. Since the parasite 
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Figure 10 Representative time series of the two clusters. Note that the magnitude of (a) and (c) are greater than that of (b) and (d). The 
frame number is depicted on the X-axis while the Y-axis denotes the change in area. 



identified using the sketch shows greater mobility than 
the one obtained using DTW+MSD, we postulate that 
sketching may be a better approach for finding repre- 
sentative time series. Figure 12 shows the representative 
time series identified by DTW and sketching for the 
two clusters obtained on the 7 th day after exposed to 
the drug Imipramine. Figure 13 shows the shape change 
of the representative parasites from each of the two 
clusters. 

Experiments 

In the following, we present a number of experiments 
and case studies to validate the proposed method and 
apply it to analyze data from phenotypic screens. The 
data used in this experiment was generated by screen- 
ing six compounds which are shown in Figure 14. 



These compounds were chosen from published whole- 
organism screening activities to reposition and poten- 
tially fast-track known drugs as therapy for schistoso- 
miasis. As visually interpreted using the constrained 
nomenclature from [20], these compounds elicited 
consistent, striking and disparate responses from schis- 
tosomula. These responses included: parasite hyper- 
motility (induced by the tricyclic neuroactive com- 
pounds chlorpromazine and imipramine), darkening 
and decreased motility leading to death (induced by 
the anti-hypercholesterolemia drugs: lovastatin, pravas- 
tatin and simvastatin). Finally, being the current ther- 
apy for schistosomiasis, PZQ, was also chosen. This 
compound generated a variety of dose-dependent 
responses not seen with the other compounds that 
included shrinking and degeneration combined with 
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Figure 11 Shape change in the parasites corresponding to the representative time series for each cluster. The snapshots depict 
parasites at the first frame and at every 45 th frame thereafter. As can be seen, based on the rate of shape change, the parasites in (a) and (c) are 
more active than those in (b) and (d). 



hyper-motility. The phenotypes analyzed by us in this 
paper were recorded 7 days after the exposure of the 
parasites to the drug. 

We begin by presenting results that quantify the accu- 
racy of the image segmentation and tracking. Next, 
experiments related to time series clustering and the 
identification of representative time series are presented. 
Included in this section are results from a case study 
which was conducted to compare the phenotypic 
response within two groups of parasites. The control 
constituted the first group while the second group was 
exposed to the current gold standard drug PZQ. An 
important result from this study was that the effect of 
PZQ could be stratified in terms of three distinct phe- 
notype clusters. Following this, in the section "Cluster 
identification using phenotypes from control and multi- 
ple compounds", the ability of the proposed method to 
automatically segregate phenotypes arising from the 
effect of different compounds is analyzed. A key goal of 
this analysis was to find the best combination of the 



alphabet size, the segment length and the distance func- 
tion for use in subsequent experiments. Results of using 
these parameters for classifying the phenotypes elicited 
by the six compounds are presented in Section "Cluster- 
ing of phenotypes elicited from all compounds". These 
results constitute proof of concept for the proposed 
method. Finally, in the section "Identification of repre- 
sentative time series" we present a case study that 
involves determining the representative phenotype mod- 
els for the controls as well as for parasites exposed to 
different compounds. 

Analysis of the accuracy of image segmentation and 
tracking 

To determine the effectiveness of the segmentation and 
tracking, we manually counted the number of parasites 
in five videos (See Table 4). The manual counts repre- 
sented the "ground truth" for this sample. Next, the 
ground truth was compared to the number of parasites 
which were correctly segmented and the number of 



Lee et al. BMC Genomics 2012, 13(Suppl 1):S4 
http://www.biomedcentral.com/1471-2164/13/S1/S4 



Page 1 7 of 24 




100 tSO 200 250 300 3KF 
(a)C]u5l«:l (DTW + MSD) 



1DG 150 ZIJ 250 300 350 

(b)Clu5JkT2fDTW + M8D) 



hi 




0 » 100 i» 200 2S0 MO i» 



(c) Clu-surl (sketch) (d) Clusierf (slceit h) 

Figure 12 Representative time series of the two clusters. The data was collected on the seventh day after exposure to the drug Imipramine. 
Note that the magnitude of (a) is greater than that of (b) and the magnitude of (c) is also greater than that of (d). In this figure the frame 
number is depicted on the X-axis and the change in the size of the parasites is shown on the Y-axis. 



parasites which were correctly tracked across the video. 
The third column of Table 4 (Segmented Parasites) 
shows the number of correctly segmented parasites in 
the first frame of the corresponding video. The next col- 
umn (Tracked Parasites) shows the number of parasites 
which were correctly tracked throughout the entire 
duration of the video. Overall, the tracker successfully 
tracked 224 out of 272 parasites (82.3%). What is nota- 
ble about this statistic is that across all the videos only 
206 out of 272 parasites (75.7%) were correctly segmen- 
ted on the first frame. Thus, these results highlight the 
fact that tracking can lead to improved parasite segmen- 
tation. Specifically, for the given data set, the overall 



improvement in parasite segmentation was about 6.6%. 
Additionally, there was a 10.6% false positive rate, due 
to errors in tracking. 

Data pre-processing and parameters employed in time- 
series analysis 

The video data was collected and analyzed using the 
methods described in Sections "Parasite identification by 
image segmentation" and "Parasite tracking". As is well 
known, real-world data tends to be noisy. To reduce the 
noise, each value of the descriptor in the time series was 
replaced by the mean of neighboring values within a 
sliding window and then outliers were smoothed out by 
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Figure 13 Snapshots depicting the shape change in the parasites corresponding to the representative time series for each cluster 
from Figure 12 at the first frame and every 45 th frame thereafter. 



a density-based local outlier detection method [46]. If an 
object does not have minimum number of neighbors 
within certain distance, the object is considered as an 
outlier. In all the experiments, the alphabet size used by 
us for transformation of the time-series to a symbolic 
representation was either 5 or 6. The video recordings 
were also manually analyzed by experts using the proto- 
col developed in [20] and qualitative descriptors were 
assigned to each video. These qualitative descriptors 
were used by us for independent validation of our 
results. 

Case study: analysis of phenotypes of control vs PZQ 

In this case study, we analyzed the phenotypic differ- 
ences exhibited by control parasites and those exposed 
to PZQ. Two clusters were identified for the control 
group. Clusterl was a singleton (n = 1) and cluster 2 
was the dominant cluster (n = 40). The standard devia- 
tions of cluster 2 (Figure 15(b)) at every 10 th frame were 
found to be nearly uniform. Figure 16 shows the shape 



change of the representative parasites of control group 
and the group exposed to PZQ. The reader may note 
that the shape descriptor for this data set not only gave 
us the shape information but also provided information 
regarding the frequency with which the parasite moved 
due to the fact that parasite is correlated with change in 
shape over time. The clustering result for the control 
group means that most parasites in this group had simi- 
lar shape over time. On the other hand, the presence of 
three clusters amongst the parasites exposed to PZQ 
seemed to indicate that the effects of the drug were 
manifested differently in different parasites. This is 
shown through the snapshots of the three clusters (b, c, 
and d in Figure 13). 

Cluster identification using phenotypes from control and 
multiple compounds 

The goal of this experiment was to find the best combi- 
nation of the alphabet size, the segment length and the 
distance function so as to separate the phenotypic 
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Figure 14 The structures of the six compounds used in the experiments. 
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Table 4 Image segmentation and tracking accuracy of five groups 


Compound 


Total 


Segmented 


Tracked 


False positives 


Segmentation 


Tracked 




parasites 


parasites 


parasites 




accuracy 


accuracy 


Pravastatin 


50 


33 


41 


3 


66% 


82% 


Simvastatin 


67 


58 


62 


5 


87% 


93% 


Imipramine 


53 


43 


39 


4 


81% 


74% 


PZQ 


42 


27 


31 


8 


64% 


74% 


Control 


60 


45 


51 


9 


75% 


85% 



response of parasites that were exposed to different 
compounds. We used the shape descriptor defined as 
the ratio of the end-point distance to the length of the 
skeleton. This descriptor is especially well suited to dis- 
tinguish parasites having normal shape from those that 
are straight. Four samples were selected from the three 
groups: control, PZQ and Lovastatin. The sampling was 
made in the following way. For each group, all of the 
time series were optimally segmented and then symboli- 
cally represented using our proposed method. Given the 
strings, clusters were found by the agglomerative hier- 
archical clustering and then four samples were selected 
from each one of the clusters. Four distance functions 
were tried by the agglomerative hierarchical clustering: 
MinDist of the original SAX method, edit distance, 
Euclidean distance and DTW. The distance for every 
pair of the symbolic representations of time series 



sequences was computed by each of the three distance 
functions and then the clusters of those time series 
sequences were identified based on the distances. When 
Euclidean distance was used, the raw data were used 
instead of the symbolic representation. A data set was 
formed by the total 12 time series from the three 
groups; Control = {1, 2, 3, 4}, PZQ = {5, 6, 7, 8}, Lovas- 
tatin = {9, 10, 11, 12}. Note that the parasites from the 
control group were slim and long, while those treated 
with PZQ & Lovastatin had curved and oval shapes 
respectively. By doing so, the data set was clearly sepa- 
rated and the ability of the method to distinguish the 
phenotypes could be tested unambiguously. Three dis- 
tance functions were tried to investigate the clustering 
accuracy. In Figure 17 we show the dendrograms (Figure 
17(l)-(6)) constructed by SAX [41]. For the original 
SAX method, the 36 combinations by the six alphabet 
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Figure 15 Representative time series and standard deviation at every 10* frame, a = 6. (a) Clusterl is singleton so its standard deviation 
is 0. It is therefore not shown in the plot. The cluster size is denoted by n. The X-axis denotes the frame number and the Y-axis denotes the 
ratio of the end point length to the skeleton length. 
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Figure 16 Shapes of the representative parasites over time. Starting from the first frame, the snapshots are taken every 45 
Representative of the dominant cluster in the control data, (b - d) The representative parasites from each of the three clusters found 
that was exposed to PZQ. The reader may note the distinct differences of the phenotypic response of the parasites in each of the cl 



th frame, (a) 
in the set 
usters. 



sizes (a = [3, 4, 5, 6, 7, 8]) and six segment lengths (w = 
[3, 7, 17, 21, 51, 119]) were tried. As the dendrograms 
show, the parasites from the control and PZQ groups 
were clearly identified but the parasites from the Lovas- 
tatin group were not clearly separated from the parasites 
from the control and PZQ groups. The various combi- 
nations of the six alphabet sizes and the six segment 
lengths did not make much difference in the clustering 
as observed in dendrograms in Figure 17. 

Next, the Levenshtein distance was applied to the 
symbolic representations of optimally segmented time 
series. The parasites from the control group were found 
to be well separated but the other two groups were not 
properly distinguished (Figure 17(7)-(12)). The results 
using the Euclidean distance are shown in Figure 17(13). 
The final distance function we investigated was DTW. 
The six different alphabet sizes were tried and for all of 
them, the three clusters could be clearly identified. As 
the results show (Figure 17(14)-(18)), the alphabet size 
did not affect the clustering. 

Clustering of phenotypes elicited from all compounds 

This experiment represented an extension of the one 
carried out in the previous section with the random 
choice of parasites. Here phenotypic responses of para- 
site exposed to each of the compounds were collected 
from video observations made on the 7 th day. Four para- 
sites were randomly selected from each of the seven 



groups including the control group as follows: Control = 
{1-4}, Chlorpromazine = {5 - 8}, Imipramine = {9 - 12}, 
Lovastatin = {13 -16}, Pravastatin = {17 -20}, PZQ = {21 
- 24}, and Simvastatin = {25 - 28}. Perceptually, the 
appearance of the parasites constituted the most signifi- 
cant phenotype. Consequently, in this experiment the 
average grayscale intensity was used as the descriptor. 
The agglomerative hierarchical clustering algorithm was 
employed using DTW distance. The results of the clus- 
tering along with the ground truth and a manual 
description of the parasite appearance are shown in 
Table 5. A very high accuracy of clustering was obtained 
(error rate of 0%) with respect to the ground truth. 

Identification of representative time series 

In this experiment, we further analyzed the data from 
Section "Cluster identification using phenotypes from 
control and multiple compounds", where a shape 
descriptor was used to cluster the phenotypes arising as 
a response to Lovastatin and PZQ in comparison to the 
control. The representative time series of each cluster 
(Figure 18) was found by sketching and DTW. The 
descriptor values for PZQ (Figure 18(b)) were amongst 
the smallest for the three drugs, indicating that the drug 
caused the parasite body to curl. The snapshots of the 
three parasites of the representative time series for the 
three groups are shown in Figure 19. In this figure, the 
clear difference in the shape change can be observed. 
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Figure 17 Dendrograms constructed by various distance functions, alphabet sizes and segment lengths. Control group = {1,2, 3, 4}, PZQ 
= {5, 6, 7, 8}, Lovastatin = {9, 10, 11, 12}. Dendrograms constructed using SAX and MinDist with varying alphabet sizes and varying number of 
segments are shown in parts (1-6). For each case, the six segments (w = [3, 7, 17, 21, 51, 1 19]) resulted in the same clusters. In parts (7-12) 
dendrograms constructed by optimal segmentation and edit distance are shown. The dendrogram constructed using Euclidean distance is 
shown in (13). In parts (14-18) dendrograms constructed using the proposed method are depicted. In our method, different alphabet sizes did 
not lead to differences in clustering accuracy. In this figure the sequence number is shown on the X-axis and the distance values are shown on 
the Y-axis. 



Table 5 Clusters obtained for phenotypes elicited from all compounds 


Compound 


Cluster 1 


Cluster 2 


Manual phenotype assignment 








(based on parasite appearance) 


Control 




1, 2, 3, 4 


Light 


Chlorpromazine 


5, 7 




Dark 






6, 8 


Light 


Imipramine 


9, 11, 12 




Dark 






10 


Light 


Lovastatin 


13, 14, 15, 16 




Dark 


Pravastatin 




17, 18, 19, 20 


Light 


PZQ 




21, 22, 23, 24 


Light 


Simvastatin 


25, 26, 27, 28 




Dark 
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Furthermore, DBSCAN was also clearly able to identify 
the membership of each time series with parameters 
MinPts = 2 and e = 0.012; {1, 2, 3, 4}, {5, 6, 7, 8} and {9, 
10, 11, 12}. This experiment along with its counterparts 
demonstrates that the proposed approach can be used 
for accurately grouping and quantifying phenotypic 
responses to different drugs. Furthermore, the method 
can also provide, automatically, the representative phe- 
notype models present in the data. 

Conclusions 

The research presented in this paper represents a signif- 
icant breakthrough towards quantitative phenotypic 
drug screening against neglected diseases, such as schis- 
tosomiasis, where the effect of a drug on the target 
pathogen is manifested through a continuum of 



complex phenotypes. The proposed method lies at the 
interface of disciplines. From the algorithmic perspec- 
tive, a key contribution of this work has been the adap- 
tation and extension of techniques from time-series data 
analysis for representation and reasoning about pheno- 
types exhibited by schistosomula. An important result 
from this perspective has been the development of a rig- 
orous approach to automatically quantify and character- 
ize the phenotypic responses different parasites to a 
drug. Consequently, the proposed method can be crucial 
for development of high-throughput phenotypic screens 
where a much larger fraction of the chemical space can 
be explored during lead discovery. Another important 
result lies in the ability of the method to detect and 
represent variability in the response of different parasites 
when they were exposed to the same drug in identical 




(a J Chis-lefL {Lo-vasLaiin) 




{b) Cluster^ (PZQ) 




(t) Cluster* fConUCl) 



Figure 19 Representative parasites for each of the three clusters. The figure shows snapshots that were captured starting at the first frame 
and at every 45 th frame thereafter. 
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environmental conditions. Recognizing such stratifica- 
tions in the parasite population may be significant in 
more ways than one. Among others, detection of such 
variability can play a major role in driving exploration of 
the pathogen's biology and in understanding the devel- 
opment of resistance to drugs. Furthermore, the exis- 
tence of such variability also underlines the need for 
developing new computational and statistical methods 
that can robustly analyze highly variable data from high- 
throughput phenotypic screens. 
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